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ABSTRACT 

Understanding the dynamic mechanisms that drive the high- 
impact scientific work (e.g., research papers, patents) is a 
long-debated research topic and has many important im¬ 
plications, ranging from personal career development and 
recruitment search, to the jurisdiction of research resources. 
Recent advances in characterizing and modeling scientific 
success have made it possible to forecast the long-term im¬ 
pact of scientific work, where data mining techniques, super¬ 
vised learning in particular, play an essential role. Despite 
much progress, several key algorithmic challenges in rela¬ 
tion to predicting long-term scientific impact have largely 
remained open. In this paper, we propose a joint predic¬ 
tive model to forecast the long-term scientific impact at 
the early stage, which simultaneously addresses a number of 
these open challenges, including the scholarly feature design, 
the non-linearity, the domain-heterogeneity and dynamics. 
In particular, we formulate it as a regularized optimization 
problem and propose effective and scalable algorithms to 
solve it. We perform extensive empirical evaluations on 
large, real scholarly data sets to validate the effectiveness 
and the efficiency of our method. 

Categories and Subject Descriptors 

H. 2.8 [Database Management]: Database applications— 
Data mining 

Keywords 
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I. INTRODUCTION 

Understanding the dynamic mechanisms that drive the 
high-impact scientific work (e.g., research papers, patents) 
is a long-debated research topic and has many important 
implications, ranging from personal career development and 
recruitment search, to the jurisdiction of research resources. 
Scholars, especially junior scholar, who could master the key 
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to producing high-impact work would attract more atten¬ 
tions as well as research resources; and thus put themselves 
in a better position in their career developments. High- 
impact work remains as one of the most important criteria 
for various organization (e.g. companies, universities and 
governments) to identify the best talents, especially at their 
early stages. It is highly desirable for researchers to judi¬ 
ciously search the right literature that can best benefit their 
research. 

Recent advances in characterizing and modeling scientific 
success have made it possible to forecast the long-term im¬ 
pact of scientific work. Wuchty et al. [28] observe that pa¬ 
pers with multiple authors receive more citations than solo- 
authored ones. Uzzi et al. [26] find that the highest-impact 
science work is primarily grounded in atypical combinations 
of prior ideas while embedding them in conventional knowl¬ 
edge frames. Recently, Wang et al. 
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develop a mecha¬ 
nistic model for the citation dynamics of individual papers. 
In data mining community, efforts have also been made to 
predict the long-term success. Carlos et al. [ 5 ] estimate the 
number of citations of a paper based on the information of 
past articles written by the same author(s). Yan et al. 30 


design effective content (e.g., topic diversity) and contex¬ 
tual (e.g., author’s ft-index) features for the prediction of 
future citation counts. Despite much progress, the follow¬ 
ing four key algorithmic challenges in relation to predicting 
long-term scientific impact have largely remained open. 

Cl Scholarly feature design: many factors could affect sci¬ 
entific work’s long-term impact, e.g., research topics, 
author reputations, venue ranks, citation networks’ 
topological features, etc. Among them, which bears 
the most predictive power? 


C2 Non-linearity: the effect of the above scholarly fea¬ 
tures on the long-term scientific impact might be way 
beyond a linear relationship. 

C3 Domain heterogeneity: the impact of scientific work 
in different fields or domains might behave differently; 
yet some closely related fields could still share certain 
commonalities. Thus, a one-size-hts-all or one-size- 
hts-one solution might be sub-optimal. 

C4 Dynamics: with the rapid development of science and 
engineering, a significant number of new research pa¬ 
pers are published each year, even on a daily basis with 
the advent of arXi\0 The predictive model needs to 
handle such stream-like data efficiently, to reflect the 
recency of the scientific work. 
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Table 1: Symbols 



Figure 1: An illustrative example of the proposed joint pre¬ 
dictive model. Papers from the same domain (e.g., AI, 
Databases, Data Mining and Bio) share similar patterns in 
terms of attracting citations over time. Certain domains 
(e.g., AI and Data Mining) are more related with each other 
than other domains (e.g., AI and Bio). We want to jointly 
learn four predictive models (one for each domain), with the 
goal of encouraging the predictive models from more related 
domains (e.g., AI and Data Mining) to be ‘similar’ with each 
other. 


In this paper, we propose a joint predictive model Impact 
Crystal Ball (iBall in short) - to forecast the long term sci¬ 
entific impact at an early stage by collectively addressing 
the above four challenges. First (for Cl), we found that the 
citation history of a scholarly entity (e.g., paper, researcher, 
venue) in the first three years (e.g., since its publication 
date) is a strong indicator of its long-term impact (e.g., the 
accumulated citation count in ten years); and adding addi¬ 
tional contextual or content features brings little marginal 
benefits in terms of prediction performance. This not only 
largely simplifies the feature design, but also enables us to 
forecast the long-term scientific impact at its early stage. 
Second (for C2), our joint predictive model is flexible, being 
able to characterize both the linear and non-linear relation¬ 
ship between the features and the impact score. Third (for 
C3), we propose to jointly learn a predictive model to differ¬ 
entiate distinctive domains, while taking into consideration 
the commonalities among these similar domains (see an il¬ 
lustration in Figure[l|. Fourth (for C4), we further propose 
a fast on-line update algorithm to adapt our joint predictive 
model efficiently over time to accommodate newly arrived 
training examples (e.g., newly published papers). 

Our main contributions can be summarized as follows: 

• Algorithms: we propose a joint predictive model - 
iBall- for the long-term scientific impact prediction 
problem, together with its efficient solvers. 

• Proofs and analysis: we analyze the correctness, 
the approximation quality and the complexity of our 
proposed algorithms. 

• Empirical evaluations: we conduct extensive exper¬ 
iments to demonstrate the effectiveness and efficiency 
of our proposed algorithms. 

The rest of the paper is organized as follows. Section [2] 
gives the problem definition. Section [3] provides empirical 


Symbols 

Definition 

n d 

number of domains 

rii 

number of training samples in the i-th domain 

rrii 

number of new training samples in the i-th domain 

d 

feature dimensionality 


feature matrix of training samples from the i-th 
domain at time t 

Xt+l 

feature matrix of new training samples from the 
i-th domain at time t + 1 

V ( J ) 

1 t 

impact vector of training samples from the i-th 
domain at time t 

yt+i 

impact vector of new training samples from the 
i-th domain at time t + 1 

A 

adjacency matrix of domain relation graph 

w (l) 

model parameter for the i-th domain 

K (i > 

kernel matrix of training samples in the z-th do¬ 
main 

K w) 

cross domain kernel matrix of training samples in 
the i-th and j-th domains 


observation of the AMiner citation network dataset. Sec¬ 
tion Hlproposes our joint model and the fast algorithm. Sec¬ 
tion [5] shows the experimental results. Section [6| reviews 
related work and the paper concludes in Section 

2. PROBLEM STATEMENT 

In this section, we first present the notations used through¬ 
out the paper and then formally define the long-term scien¬ 
tific impact prediction for scholarly entities (e.g., research 
papers, researchers, conferences). 

Table[l]lists the main symbols used throughout the paper. 
We use bold capital letters (e.g., A) for matrices, bold lower¬ 
case letters (e.g., w) for vectors, and lowercase letters (e.g., 
A) for scalars. For matrix indexing, we use a convention sim¬ 
ilar to Matlab as follows, e.g., we use A (i,j) to denote the 
entry at the i-th row and j -th column of a matrix A, A (i,:) 
to denote the i-th row of A and A (:,j) to denote the j-th 
column of A. Besides, we use prime for matrix transpose, 
e.g., A' is the transpose of A. 

To differentiate samples from different domains at differ¬ 
ent time steps, we use superscript to index the domain and 
subscript to indicate timestamp. For instance, Xj. 1 ' denotes 
the feature matrix of all the scholarly entities in the i-th do¬ 
main at time t and xj 1 ^ denotes the feature matrix of new 
scholarly entities in the i-th domain at time t + 1. Hence, 
Xj: 1 ^ = [X' ,) ;x< 1 ) 1 |. Similarly, Y) 1 ' denotes the impact vec¬ 
tor of scholarly entities in the i-th domain at time t and Yt+i 
denotes the impact vector of new scholarly entities in the i- 
th domain at time t + 1. Hence, Y^ 1 ^ = [Y^ 1 - 1 ; yt+ij. We 
will omit the superscript and/or subscript when the meaning 
of the matrix is clear from the context. 

With the above notations, we are ready to define the long¬ 
term impact prediction problem in both static and dynamic 
settings as follows: 

Problem 1. Static Long-term Scientific Impact Predic¬ 
tion 

Given: feature matrix X and impact Y of scholarly entities 




























Predict: the long-term impact of new scholarly entities 

We further define the dynamic impact prediction problem 
as: 


Problem 2. Dynamic Long-term Scientific Impact Pre¬ 
diction 

Given: feature matrix Xt and new training feature matrix 
x t +i of scholarly entities, the impact vector Y t , and 
the impact vector of new training samples yt+i 

Predict: the long-term impact of new scholarly entities 


3. EMPIRICAL OBSERVATIONS 

In this section, we perform an empirical analysis to high¬ 
light some of the key challenges (summarized in introduction 
section), on AMiner citation network [23]. This is a rich real 
dataset for bibliography network analysis and mining. The 
dataset contains 2,243,976 papers, 1,274,360 authors, and 
8,882 computer science venues. For each paper, the dataset 
provides its titles, authors, references, publication venue and 
publication year. The papers date from year 1936 to 2013. 
In total, the dataset has 1,912,780 citation relationships ex¬ 
tracted from ACM library. 

3.1 Power-law distribution 

The distribution of the citation counts of all the papers 
and the distribution of the number of citations received 
within 10 years after publication are presented in Figures [5a] 
and !5bl We also show the distribution of citation counts of 
all the authors and all the venues respectively in Figures [5c] 
and |5d| It is clear that all these citations are of a power 
law distribution. Nearly 87.45% papers have zero citations 
within 10 years. 
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(a) Distribution of the 
number of all citations of 
papers. 
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10 years citation 

(b) Distribution of the 
number of citations of pa¬ 
pers received within 10 
years after publication. 
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3.2 Feature design 

Prior work [3, 30] has proposed some effective features for 
citation count prediction, e.g., topic features (topic rank, di¬ 
versity), author features (h- index, productivity), venue fea¬ 
tures (venue rank, venue centrality). Other work [27] make 
predictions only on the basis of the early years’ citation 
data and find that the future impact of majority papers 
fall within the predicted citation range. We conduct experi¬ 
ment to compare performance of different features. Figure[2] 
shows the root mean squared error using different features 
with a regression model for the prediction of 10 years’ ci¬ 
tation count. For example, ‘3 years’ means using the first 
3 years’ citation as feature, and ‘3 years + content’ means 
using the first 3 years’ citation along with content features 
(e.g., topic, author features). The result shows that adding 
content features (the right three bars in the figure) brings 
little improvement for citation prediction. 

3.3 Non-linearity 

To see if the feature has linear relationship with the cita¬ 
tion, we compare the performance of different methods using 
only the first 3 years’ citation history. In Figure [3] the non¬ 
linear models (iBaIl-fast, iBall-kernel, Kernel-combine) all 
outperform the linear models (iBall-linear, Linear-separate, 
Linear-combine). See Section 4] and [5] for details of these 
models. It is clear that complex relationship between the 
features and the impact cannot be well characterized by a 
simple linear model - the prediction performance for all the 
linear models is even worse than the baseline method (using 
the summation of the first 3 years’ citation counts). 

3.4 Domain heterogeneity 

To get a sense of the dynamic patterns of the citation 
count, we construct a paper-age citation matrix M, where 
My indicates the number of citations the i-th paper receives 
in the j- th year after it gets published. The matrix M is 
then factorized as M « WH using Non-negative Matrix 
Factorization (NMF) [14]. We visualize the first six rows of 
H in Figure[4] which can give us different clustering citation 
dynamic patterns. As can be seen from the figure, the cyan 
line has a very small peak in the first 3 years and then fades 
out very quickly; the blue line picks up very fast in the early 
years and then fades out; the yellow line indicates a delayed 
pattern where the scientific work only receives some amount 
of attentions decades after it gets published. This highlights 
that impact of scientific work from different domains behaves 
differently. 

4. PROPOSED ALGORITHMS 

In this section, we present our joint predictive model to 
forecast the long-term scientific impact at an early stage. 
We first formulate it as a regularized optimization problem; 
then propose effective, scalable and adaptive algorithms; fol¬ 
lowed up by theoretical analysis in terms of the optimality, 
the approximation quality as well as the computational com¬ 
plexity. 


(c) Distribution of the (d) Distribution of the 
number of all citations of number of all citations of 
authors. venues 

Figure 5: Citation distributions of AMiner citation dataset. 


4.1 iBall - Formulations 

Our predictive model applies to different types of schol¬ 
arly entities (e.g., papers, researchers and venues). For the 
sake of clarity, we will use paper citation prediction as an ex¬ 
ample. As mentioned earlier, research papers are in general 















Figure 2: Prediction error comparison 
with different features. 
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Figure 3: RMSE comparisons using dif¬ 
ferent methods. The citation count is 
normalized in this figure. See Section [5] 
for normalization details. 



Age 

Figure 4: Visualization of papers’ cita- 
toin behavior. Different colors encodes 
different citation behaviors. 


from different domains. We want to jointly learn a predictive 
model for each of the domains, with the design objective to 
leverage the commonalities between related domains. Here, 
the commonalities among different domains is described by a 
non-negative A, i.e., if the *-th and j -th domains are closely 
related, its corresponding A entry will have a higher nu¬ 
merical value. Denote feature matrix for papers in the i-th 
domain by X (l) , citation count of papers in the i-th domain 
by Y (l ^ and the model parameter for the i-th domain by 
w (1 *, we have the following joint predictive model: 

n d 

min ££[/(X (i) , W «),Y®] 

wC 1 ) ,i=l,...,n d i= 1 

n d n d n d 

+6J2 £ A ij3 (w«,w ( -») + A£Ll(w (i >) 

2=1 j= 1 2=1 

(D 

where /(X (l) ,w 1 ^) is the prediction function for the i-th 
domain, £(.) is a loss function, g( w (l) , w*-®') characterizes the 
relationship between the model parameters of the i-th and 
j -th domains, f2(w (1 ^) is the regularization term for model 
parameters and 9, A are regularization parameters to balance 
the relative importance of each aspect. 

As can be seen, this formulation is quite flexible and gen¬ 
eral. Depending on the loss function we use, our predictive 
model can be formulated as regression or classification task. 
Depending on the prediction function we use, we can have ei¬ 
ther linear or non-linear models. The core of our joint model 
is the second term that relates parameters of different mod¬ 
els. If A ij is large, meaning the i-th and j -th domains are 
closely related to each other, we want the function value g(.) 
that characterizes the relationship between the parameters 
to be small. 

iBall - linear formulation: if the feature and the output 
can be characterized by a linear relationship, we can use a 
linear function as the prediction function and the Euclidean 
distance for the distance between model parameters. The 
linear model can be formulated as follows: 

min £ ||X (i) w (i) - Y (i) ||| 

,2=1,...,n d i=l 

nd n d n d 

+9 £ £ Aij || w W - w® ||| + A £ ||w (i) ||| 

2=1 j = 1 2 = 1 


where 9 is a balance parameter to control the importance 
of domain relations, and A is a regularization parameter. 
From the above objective function we can see that, if the 
i-th domain and j -th domain are closely related, i.e., A ij is 
a large positive number, it encourages a smaller Euclidean 
distance between w (l) and w (J) . The intuition is that for a 
given feature, it would have a similar effect in predicting the 
papers from two similar/closely related domains. 

iBall - non-linear formulation: As indicated in our empiri¬ 
cal studies (Figure[3|, the relationship between the features 
and the output (citation counts in ten years) is far beyond 
linear. Thus, we further develop the kernelized counterpart 
of the above linear model. Let us denote the kernel matrix 
of papers in the i-th domain by K (l) , which can be com¬ 
puted as K (1 '(a,fe) = A;(X (l) (a,:), X (1) (6,:)), where fc(-,-) is 
a kernel function that implicitly computes the inner prod¬ 
uct in a high-dimensional reproducing kernel Hilbert space 
(RKHS) 1 . Similarly, we define the cross-domain kernel 
matrix by which can be computed as K (l - i - ) (o,6) = 

fc(X (l - ) (a,:), X lj) (6,:)), reflecting the similarity between pa¬ 
pers in the i-th domain and j-th domain. Different from the 
linear model where the model parameters in different do¬ 
mains share the same dimensionality (i.e., the dimensional¬ 
ity of the raw feature), in the non-linear case, the dimension¬ 
ality of the model parameters are the same as the number 
of training samples in each domain, which is very likely to 
be different across different domains. Thus, we cannot use 
the same distance function for g(.). To address this issue, 
the key is to realize that the predicted value of a test sample 
using kernel methods is a linear combination of the similar¬ 
ities between the test sample and all the training samples. 
Therefore, instead of restricting the model parameters to be 
similar, we impose the constraint that the predicted value of 
a test sample using the training samples in its own domain 
and using training samples in a closely related domain to be 
similar. The resulting non-linear model can be formulated 
as follows: 

min £ ||K (i) w (i) - Y (i) ||| 

w(*) ,2=1,... ,rid 2=1 

+9 £ £ Aij||K«w« - K^w^Hl (3) 

i=lj=l 

+A £ w (i >'KWw (i > 

2=1 


( 2 ) 












where 0 is a balance parameter to control the importance 
of domain relations, and A is a regularization parameter. 
From the above objective function we can see that, if the 
j-th domain and j-th domain are closely related, i.e., A,j is 
a large positive number, the predicted value of papers in the 
i-th domain computed using training samples from the i-th 
domain (K (l) w (l ^) should be similar to that using training 
samples from the j-th domain (K^w^). 

4.2 iBall - Closed-form Solutions 

It turns out that both iBall linear and non-linear formula¬ 
tions have the following closed-form solutions (see proof in 
subsection 14.41) : 


w = S -1 Y (4) 

iBall linear formulation. In the linear case, we have that 
w = [w (1) ;...;w (nd) ], Y = [X (1) 'Y (1) ;...; X (nd) 'Y (nd) ], 
and S is a block matrix composed of nd x Hd blocks, each of 
size d x d, where d is the feature dimensionality. S can be 
computed as follows: 


i-th block column j-th block column 


X ') X (,) + (6 £ Ay + A)I -0AyI 

3 =1 


i-th block 
row 


(5) 

iBall non-linear formulation. In the non-linear case, we have 
that w = [w (1) ;...;w (nd) ], Y = [Y (1) ;...; Y (nd) ], and S is 
a block matrix composed of na x na blocks with the (i, j)-th 
block of size nt x nj, where ni is the number of training 
samples in the i-th domain. S can be computed as follows: 


i-th block column j-th block column 


n d 

(l + 0£ Ay)K (i) +AI 

3 = 1 


-0AijK (ij) 


i-th block 
row 


( 6 ) 


4.3 iBall - Scale-up with Dynamic Update 

The major computation cost for the closed-form solutions 
lies in the matrix inverse S -1 . In the linear case, the size of 
S is (dnd) x ( dnd ); and so its computational cost is manage¬ 
able. However, this is not the case for non-linear closed-form 
solution since the matrix S in Eq. |6| is of size n x n, where 
n = £ r (^ 1 which is the number of all the training sam¬ 
ples. It would be very expensive to store this dense matrix 
( 0(n 2 ) space) and to compute its inverse ( 0(n 3 ) time); es¬ 
pecially when the number of training samples is very large, 
and the model receives new training examples constantly 
over time (dynamic update). In this subsection, we devise 
an efficient algorithm to scale up the non-linear closed-form 
solution and efficiently update the model to accommodate 
the new training samples over time. The key of the iBall al¬ 
gorithm is to use the low-rank approximation of the S ma¬ 
trix to approximate the original S matrix to avoid the ma¬ 
trix inversion; and at each time step, efficiently update the 
low-rank approximation itself. 


After new papers in all the domains are seen at time step 
t+ 1, the new S t +i computed by Eq. (|6| becomes: 


i-th block column j-th block column 


(1 + 0 £ Ay)K« 1+ AI -0A i jK'l ) 1 

3 = 1 


i-th block 
row 


(7) 

where K)‘2 i is the new within-domain kernel matrix for the 
j-th domain and kJ. 1 ^ is the new cross domain kernel matrix 
for the j-th and j -th domains. The two new kernel matrix 
can be computed as follows: 


xAO _ 
■*H+i — 



(On 


|K< ij) 

idm) I 

K t+i 

- k Ui 

ht+i - 

K t+1 — 

k(i*j) 

L K t+l 

X1 t+1 J 


(8) 


where kj. 1 ^ is the matrix characterizing the similarity be¬ 
tween new training samples and old training samples and 
can be computed as: 6) = fc(x^ 1 2 1 (a,:), X^(6,:)); 

hj.‘2i is the similarity matrix among new training samples 
and can be computed as: h^i(a, b) = fc(x£?i(a, :),x^i(6,: 
)). k&{> is the matrix characterizing the similarity be¬ 

tween new training samples in the j-th domain and old train¬ 
ing samples in the j-th domain and can be computed as: 

kt+iV.fc) = M x t+i(“. : )): X t J) (>, O- Similarly, mea¬ 

sures the similarity between old training samples in the i- 
th domain and new training samples in the j-th domain 
and can be computed as: kjvh-* = fc(X^(a,:), :)); 

j*) s i m ii ar jty matrix between new training sam¬ 

ples from both i-th and j-th domains and is computed as: 
h l 1 +i* ) = :),x^ 1 (b,:)). 

Given that St is a symmetric matrix, we can approximate 
it using top-r eigen-decomposition as: St « UtA t U(, where 
U t is an n x r orthogonal matrix and At is an r x r diagonal 
matrix with the largest r eigenvalues of St on the diagonal. 
If we can directly update the eigen-decomposition of St+i 
after seeing the new training samples from all the domains, 
we can efficiently compute the new model parameters as 
follows: 

wt+i ®t+i"^ t + 1 /Q\ 

= Ut+iA-^m+tYt+i 1 j 

where Y t+1 = [Y^; . • •; Y<" d) ; y£ d) ]. Here, A f £\ a 

r x r diagonal matrix, whose diagonal entries are the recipro¬ 
cals of the corresponding eigenvalues of At+i. In this way, 
we avoid the computationally costly matrix inverse in the 
closed-form solution. 

Compare St+i with St, we find that S t+ i can be obtained 
by inserting into St at the right positions with some rows 
and columns of the kernel matrices involving new training 
samples, i.e-.k'^, h^i.ki+^k^.k^i* 1 . From this per¬ 
spective, St+i can be seen as the sum of the following two 












matrices: 


Algorithm 2: iBall -scale-up with dynamic update 


i-th block column j-th block column 


Q ! K[ i) 0~ 


’-6>AijKf j) O' 

0 °_ 


0 0 


i-th block 
row 


S t 

i-th block column j-th block column 





n — 

u WA u K t+l 




eA iJ k t + l WA ij n t + l 


AS 

= S t + AS 


Input: (l)eigen pair of St: Ut, At; 

(2) feature matrices of new papers in each domain: 

(i) • i 

x t+i’* = 1, ■ • ■ ,n d -, 

(3) citation count vectors of new papers in each domain: 

(i) . , 

y t +i>* = 1 

(4) adjacency matrix of domain relation graph A ; 

(5) balance parameters 9, A 

Output: (1) updated model parameters wt+i, (2) 
eigen pair of S t+ i: U t+ i, A t +i 

i-th block 

row ! Update the eigen-decomposition of St+i using 
Algorithm FT] as: St+i « Ut+iAt+iUj+i; 

_ 2 ^ Compute the new model parameters: 

wt+i = Ut+iAj^Uj+iYt+i; 

3 Return: wt+i, Ut+i and A t+ i. 


(10) 

where we denote 1 + i Ay by at. The top-r eigen- 

decomposition of S t can be directly written out from that of 
St as: S t « UtAtUj, where Ut can be obtained by inserting 
into Ut corresponding rows of 0, the same row positions as 
we insert into St the new kernel matrices. We propose Al¬ 
gorithm [l] to update the eigen-decomposition of St+i, based 
on the observation that S t+ i can be viewed as St perturbed 
by a low-rank matrix AS. In line 5 of Algorithm [I] the 
only difference between the partial QR. decomposition and 
the standard one, is that since Ut is already orthogonal, we 
only need to perform the Gram-Schmidt procedure starting 
from the first column of P. 


S computed by Eq. (0 is the fixed-point solution to the 
linear formulation in Eq. 0 and the closed-form solution 
given in Eq. 0 with S computed by Eq. 0 is the fixed- 
point solution to the non-linear formulation in Eq. 0- 

Lemma 1. (Correctness of closed-form solution of the iBall lin¬ 
ear and non-linear formulations.) For the closed-form solu¬ 
tion given in Eq. 0. if S is computed by Eq. 0), it is the 
fixed-point solution to the objective function in Eq. 0/ and 
if S is computed by Eq. 0, it is the fixed-point solution to 
the objective function in Eq. 0 . 


Algorithm 1: Eigen update of St+i 


Input: (l)eigen pair of S t : Ut, At; 

(2) feature matrices of new papers in each domain: 

(i) • i 

x t+iU = 1 

(3) adjacency matrix of domain relation graph A ; 

(4) balance parameters 9, A 

Output: eigen pair of St+i: Ut+i, A t +i 


l Obtain Ut by inserting into Ut rows of 0 at the right 
positions ; 


2 Compute kilii, h (l> 


i = 1, 


H + ll X1 t 

• j n d,j = L 


u(i.j) lAU.) lALj. 

li K t+l ' K t+1 i K t + l 


for 


■ ,n d 


3 Construct sparse matrix AS ; 

4 Perform eigen decomposition of AS: AS = PSP': 

5 Perform partial QR decomposition of 


[Ut,P]:[Ut,AQ]R<-QR(U t ,P); 
e Set Z = R[A t 0; 0 E]R'; 

7 Perform full eigen decomposition of Z: Z = VLV'; 

8 Set Ut+i = [U t , AQ]V and A t+ i = L; 

9 Return: Ut+i, A t +i. 


Proof. Let’s take the partial derivative of the objective 
function (denoted by J) in Eq. 0 w.r.t. we get 

Swtil =2X«'x< i >w( i >-2X« , Y< i > 

+ 20Ay (w(‘) - w ( -») + 2Aw« V ’ 

Now, the derivative of J w.r.t. all the parameters w can be 
computed as: 


dJ 

dw 


r dJ 
a w (i) 

dJ 

- 9w- n d ) 


2X (1) 2 3 4 'X (1) W {1) 


+ E”il20AyO 


,(P 


-2X (1) 'y (1> 

— w^) + 2AW*- 1 - 1 


2X' nd ) , X ( ” d *w (rld * — 2X (nd ' ), Y^ nd ^ 

+ Xy=i 29A nd j(w (nd) - w (j) ) + 2Aw (nd) 

( 12 ) 

Set the above derivative to 0 and with some rearrange¬ 
ment, we get 


Building upon Algorithm [I] we have the fast iBall algo¬ 
rithm (Algorithm 0 for scaling up the non-linear solution 
with dynamic model update. 

4.4 iBall - Proofs and Analysis 

In this subsection, we will provide some analysis regard¬ 
ing the optimality, the approximation quality as well as the 
computational complexity of our proposed algorithms. 

A - Correctness of the closed-form solutions of the 
iBall linear and non-linear formulations: In Lemma|T] 

we prove that the closed-form solution given in Eq. 0 with 


Sw = Y (13) 

Therefore, w = S _1 Y. 

The similar procedure can be applied to get the closed- 
form solution for the non-linear formulation. We will omit 
the derivation for brevity. □ 

B - Correctness of the eigen update of S t +i: The 
critical part of Algorithm 0 is the subroutine Algorithm 0 
for updating the eigen-decomposition of St+i. According to 
Lemma 0 the only place that approximation error occurs is 
the initial eigen-decomposition of So. The eigen updating 
procedure won’t introduce additional error. 




























Lemma 2. (Correctness of Algorithm^) 7/St=UfAtU) 
holds, Algorithm [7] gives the exact eigen-decomposition of 

S t +i. 

Proof. Omitted for brevity. See 16] for details. □ 

C - Approximation Quality: We analyze the approxi¬ 
mation quality of Algorithm [2] to see how much the learned 
model parameters deviate from the parameters learned using 
the exact iBall non-linear formulation. The result is summa¬ 
rized in Theorem [I] 


□ 

D - Complexities: Finally, we analyze the complexities 
of Algorithm |T] and Algorithm [2] In terms of time complex¬ 
ity, the savings are two-folds: (1) we only need to compute 
the kernel matrices involving new training samples; (2) we 
avoid the time consuming large matrix inverse operation. In 
terms of space complexity, we don’t need to maintain the 
huge St matrix, but instead store its top-r eigen pairs which 
is only of 0(nr) space. 


Theorem 1. (Error bound of Algorithm w In Algo¬ 


rithm 


y'■ 

if —< 1, the error of the learned model 


y* \ (*) 

A t+1 


parameters is bounded by: 


||w t+ i - Wt + i|| 2 < 


E ijijtn ^ 


(i) 




- t+1 II2 


(14) 


where wt+i is the model parameter learned by the exact 
iBall non-linear formulation at time t+1, w t+ i is the updated 
model parameter output by Algorithm^ from time t to t+1, 
and a[E are the largest i-th eigenvalues of St and St+i 
respectively, 8 = ||(U t A t U{ + AS)" 1 )^-U t A t Uj)|| F , H is 
the set of integers between 1 and r, i.e., H = {a\a £ [1, r]}. 

Proof. Suppose we know the exact St at time t and its 
top-r approximation: St = U t AtU(. After one time step, 
we can construct AS and the exact St+i can be computed 
as St-i-i = St + AS. The model parameters learned by the 
exact non-linear model is: 


w t+ i 


= srAYt+i 

= (St + AS^Y, 


t+i 


(15) 


If we allow approximation as in Algorithm [2] the approx¬ 
imated model parameter is: 


wt+i = St^Yt+i 

= (UtAtUj + ASJ^Yt+i 


(16) 


Denote S t + AS by B and UtAtU) + AS by C,we have 
the following: 


MB - Cl 


= IISt-UtAtU) 




( i ) 


tll-F 


where the last inequality is due to the following fact: 
|| atUtu'|| F = ++)%? af u 'U') 

= VEia?tr(u,u') 

= vee*? 

< Et M 

Denote ||C _1 (B — C)||jr by 8, we know that 


(17) 


(18) 


8 < ||C 


-l ii 


MB — CM 




< Eigw K ' < { 

- v A (0 ^ 1 

+ i A t +1 


(19) 


From matrix perturbation theory [11], we will reach the 
following: 

>-iv /-i-i - ' 


||wt+i-wt+i ||2 = B-’Yt+i - C- X Y, 


< |]B 1 - C 

< 

< 


E, 


1-5 

x (i 

<£1-L A t 


(Ei A^i) 2 (!-«) 


t+11| 2 

■||Yt+i|| a 
l|Yt+i||a 
|Yt+l ||2 


( 20 ) 


Theorem 2. (Complexities of Algorithm^and Algorithm^) 
Algorithm [ 7 ] takes 0((n + m)(r 2 + r 12 )) time and 0((n + 
m)(r+r')) space. AlgorithmWaalso takes 0((n+m)(r 2 +r' 2 )) 
time and 0((n + m)(r + r')) space, where m is total number 
of new training samples. 

Proof. Omitted for brevity. □ 

5. EXPERIMENTS 

In this section, we design and conduct experiments mainly 
to inspect the following aspects: 

• Effectiveness: How accurate are the proposed algo¬ 
rithms for predicting scholarly entities’ long-term im¬ 
pact? 

• Efficiency: How fast are the proposed algorithms? 

5.1 Experiment Setup 

We use the real-world citation network dataset AMinei[3 
to evaluate our proposed algorithms. The statistics and em¬ 
pirical observations are described in Section [2] Our primary 
task is to predict a paper’s citations after 10 years given 
its citation history in the first three years. Thus, we only 
keep papers published between year 1936 and 2000 to make 
sure they are at least 10 years old. This leaves us 508,773 
papers. Given that the citation distribution is skewed (see 
Figure [5], the 10-year citation counts are normalized to the 
range of [0, 7]. Our algorithm is also able to predict citation 
counts for other scholarly entities including researchers and 
venues. We keep authors whose research career (when they 
publish the first paper) begin between year 1960 and 2000 
and venues that are founded before year 2002. This leaves 
us 315,340 authors and 3,783 venues. 

For each scholarly entity, we represent it as a three dimen¬ 
sional feature vector, where the i-th dimension is the number 
of citations the entity receives in the i-th year after its life 
cycle begins (e.g., paper gets published, researchers publish 
the first paper ). We build a fc-nn graph (fc = 5) among 
different scholarly entities; use METIS 113] to partition the 
graph into balanced clusters; and treat each cluster as a do¬ 
main. We set the domain number (rid) to be 10 for both 
papers and researchers; and 5 for venues. The Gaussian 
kernel matrix of the cluster centroids is used to construct 
the domain-domain adjacency matrix A. 

To simulate the dynamic scenario where training samples 
come in stream, we start with a small initial training set 
and at each time step add new training samples to it. The 
training samples in each domain are sorted by starting year 
(e.g., publication year). In the experiment, for papers, we 
start with 0.1% initial training data and at each update add 
another 0.1% training samples. The last 10% samples are 

2 http: //arnetminer.org/billboard/citation 














reserved as test samples, i.e., we always use information from 
older publications for the prediction of the latest ones. For 
authors, we start with 0.2% initial training data and at each 
update add another 0.2% training data and use the last 10% 
for testing. For venues, we start with 20%, add 10% at each 
update and use last 10% for testing. 

The root mean squared error (RMSE) between the the 
actual citation and the predicted one is adopted for accu¬ 
racy evaluation. All the experiments were performed on a 
Windows machine with four 3.5GHz Intel Cores and 256GB 
RAM. 

Repeatability of Experimenal Results: The AMiner cita¬ 
tion dataset is publicly available. We will release the code 
of the proposed algorithms through authors’ website. For 
all the results reported in this section, we set 9 = A = 0.01 
in our joint predictive model. Gaussian kernel with a = 5.1 
is used in the non-linear formulations. 

5.2 Effectiveness Results 

We perform the effectiveness comparisons of the following 
nine methods: 

1 Predict 0: directly predict 0 for test samples since ma¬ 
jority of the papers have 0 citations. 

2 Sum of the first 3 years: assume the total number of 
citations doesn’t change after three years. 

3 Linear-combine: combine training samples of all the 
domains for training using linear regression model. 

4 Linear-separate: train a linear regression model for 
each domain separately. 

5 iBall-linear: jointly learn the linear regression models 
as in our linear formulation. 

6 Kernel-combine: combine training samples of all the 
domains for training using kernel ridge regression model [l8|. 

7 Kernel-separate: train a kernel ridge regression model 
for each domain separately. 

8 iBall-kernel: jointly learn the kernel regression models 
as in our non-linear formulation. 

9 iBall-fast : proposed algorithm for speeding up the 
joint non-linear model. 

A - Overall paper citation prediction performance. The RMSE 
result of different methods for test samples from all the do¬ 
mains is shown in Figure [6] We have the following obser¬ 
vations: (1) the non-linear methods (iBall-fast, iBall-kernel, 
Kernel-separate, Kernel-combine) outperform the linear meth¬ 
ods (iBall-linear, Linear-separate, Linear-combine) and the 
straightforward ‘Sum of first 3 years’ is much better than 
the linear methods, which reflects the complex non-linear 
relationship between the features and the impact. (2) The 
performance of iBall-fast is very close to iBall-kernel and 
sometimes even better, which confirms the good approxi¬ 
mation quality of the model update and the possible de- 
noising effect offered by the low-rank approximation. (3) 

The iBall family of joint models is better than their separate 
versions (Kernel-separate, Linear-separate). To evaluate the 
statistical significance, we perform a t-test using 1.4% of the 
training samples and show the p- values in Table [2] From the 
result, we see that the improvement of our method is signifi¬ 
cant. To investigate parameter sensitivity, we perform para¬ 
metric studies with three parameters in iBall-fast, namely, 8, 


A and r. Figure [5] shows that the proposed method is stable 
in a large range of the parameter space. 

B - Domain-by-domain paper citation prediction performance. 
In Figure [lO] we show the RMSE comparison results for 
four domains with different total training sizes. iBall-kernel 
and its fast version iBall-fast consistently outperform other 
methods in all the domains. In the third domain, some lin¬ 
ear methods (Linear-separate and Linear-combine) perform 
even worse than the baseline (‘Predict O’). 

C - Prediction error analysis. We visualize the actual ci¬ 
tation vs. the predicted citation using iBall as a heat map 
in Figure 11 The (x,y) square means among all the test 
samples with actual citation y, the percentage that have 
predicted citation x. We observe a very bright region near 
the x = y diagonal. The prediction error mainly occurs in 
a bright strip at x = 1, y > 1. This is probably due to the 
delayed high-impact of some scientific work, as suggested by 
the blue and green lines in Figure [4] i.e., some papers only 
pick up attentions many years after they were published. 



Predicted Citation 


Figure 11: Prediction error analysis: actual citation vs. pre¬ 
dicted citation. Best viewed in color. 

D - Author and venue citation prediction performance. We 
also show the RMSE comparison results for the impact pre¬ 
diction of authors and venues in Figure [7] and [^respectively. 
Similar observations can be made as the paper impact pre¬ 
diction, except that for the venue citation prediction, iBall- 
linear can achieve the similar performance as iBall-fast and 
iBall-kernel. This is probably due to the effect that venue 
citation (which involves the aggregation of the citations of 
all of its authors and papers) prediction is at a much coarser 
granularity, and thus a relatively simple linear model is suf¬ 
ficient to characterize the correlation between features and 
outputs (citation counts). 

5.3 Efficiency Results 

A - Running time comparison: We compare the running 
time of different methods with different training sizes and 
show the result in Figure [12] with time in log scale. All 
the linear methods are very fast (< 0.01s) as the feature 
dimensionality is only 3. Our iBall-fast outperforms all other 
non-linear methods and scales linearly . 

B - Quality vs. speed: Finally, we evaluate how the pro¬ 
posed methods balance between the prediction quality and 
speed. In Figure |13[ we show the RMSE vs. running time 








Table 2: p-value of statistical significance 



Predict 

0 

Linear- 

combine 

Linear- 

separate 

i Ba 11- 

linear 

Sum of 
first 3 

years 

Kernel- 

combine 

Kernel- 

separate 

i Ball-fast 

i Ball-kernel 

0 

5.53e-16 

6.12e-17 

1.16e-13 

1.56e-219 

1.60e-72 

8.22e-30 

3.39e-14 





Figure 6: Overall paper citation predic¬ 
tion performance comparisons. Lower is 
better. 


Figure 7: Author citation prediction 
performance comparison. Lower is bet¬ 
ter. 


Figure 8: Venue citation prediction per¬ 
formance comparison. Lower is better. 



Figure 12: Comparison of 
running time of different 
methods. The time axis is 
of log scale. 


Figure 13: Quality vs. 
speed with 88,905 training 
samples. 


of different methods with 88,905 total training samples. For 
i Ba I l-fast, we show its results using different rank r for the 
low-rank approximation. Clearly, iBall-fast achieves the best 
trade-off between quality and speed as its results all lie in 
the bottom left corner. 


6. RELATED WORK 

In this section, we review the related work. 

Impact/popularity prediction: As a pilot study, Yan 
et al. 30, 29] identify effective features to address citation 
count prediction problem. Davletov et al. |8j address the 
same problem by first clustering papers according to their 
temporal change in citation counts over time and assigning 
a polynomial to each cluster for regression. In light of the 
difficulty posed by power law distribution of citations, Dong 
et al. 9 instead consider whether a paper can increase the 
primary author’s /i-index. Yu et al. 33 address predicting 


citation relations in heterogeneous bibliographical networks. 

A close line of work is to predict the popularity of other 
online contents, e.g., posts, videos, TV series. Yao et al. 3l] 
predict the long-term impact of questions/answers. Notice 
that in terms of methodology, the method in [3l] can be con¬ 
ceptually viewed as a special case of our i Ba 11 model when 
there are only two domains and the instance-level correspon¬ 
dence across different domains (e.g., question-answers asso¬ 


ciation) is known. Li et al. 15 conduct an study on pop¬ 
ularity forecast of videos shared in social networks. They 
consider both the intrinsic attractiveness of a video and the 
influence from the underlying diffusion structure. Chang et 
al. [5] are the first to comprehensively study for predicting 
the popularity of online serials with autoregressive models. 
As online serials have strong sequence dependence and re¬ 
lease date dependence, they develop an autoregressive model 
to capture the dynamic behaviors of audiences. Though the 
focus of this paper is to propose a tailored method to predict 
the long-term citation counts, our method could be natu¬ 
rally applied to other related applications, e.g., popularity 
prediction. 

Multi-task learning: Our joint model is also related to 
multi-task learning as we jointly learn the models for each 
domain (task). Multi-task learning aims to improve the gen¬ 
eralization performance of a learning task with the help of 
other related tasks. A key challenge in multi-task learn¬ 
ing is how to exploit the relationship among different tasks 
to allow information shared across tasks. One way is by 
sharing of parameters. In neural networks, hidden units are 
shared across tasks [ 2 ]. It can also be induced by assum¬ 
ing that the parameters used by all tasks are close to each 
other by minimizing the Frobenius norms of their differences 
in methods based on convex optimization formulations [10 . 
In Bayesian hierarchical models, parameter sharing can be 
imposed by assuming a common prior they share [32]. A 
second way is assuming a common basis of the parameter 
space. A low-rank and sparse structure of the underlying 
predictive hypothesis has been applied to capture the tasks 
relatedness as well as outlier tasks ©001 .Our method 
is directly applicable when the correlation/similarity among 
different tasks is known and enjoys a closed-form solution. 
In terms of computation, we also provide an efficient way to 
track the joint predictive model in the dynamic setting. 

Scholarly data mining: Scholarly data can be viewed 
as a heterogeneous information network of papers, authors, 
venues and terms [22]. Mining of such scholarly data is 
often from following perspectives: (1) similarity search to 
find a similar scholarly entities given a query entity or a set 



































































(a) RMSE vs. 9 


(b) RMSE vs. A 


(c) RMSE vs. r 


Figure 9: Sensitivity study on iBall-fast: study the effect of the parameters 9, X and r in terms of RMSE. 




(a) Prediction performance comparison in the first domain. (b) Prediction performance comparison in the second domain. 



(c) Prediction performance comparison in the third domain. (d) Prediction performance comparison in the fourth domain. 


Figure 10: Paper citation prediction performance comparison in four domains. 


of query entities [21 25 24]; (2)literature recommendation 
to recommend related research papers on a topic |IT, |4j; 
and (3)co-author collaboration prediction, to predict if two 
researchers will collaborate in the future 19] 20 . 


7. CONCLUSIONS 

In this paper, we propose i Ba 11 - a family of algorithms for 
the prediction of long-term impact of scientific work given 
its citation history in the first few years. The proposed al¬ 
gorithms collectively address a number of key algorithmic 


challenges in impact prediction (i.e., feature design, non¬ 
linearity, domain heterogeneity and dynamics). It is flexible 
and general in the sense that it can be generalized to both 
regression and classification models; and in both linear and 
non-linear formulations; it is scalable and adaptive to new 
training data. 
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